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Abstract. The purpose of this paper is to estimate the intensity of some random 
measure N on a set A" by a piecewise constant function on a finite partition of X. 
Given a (possibly large) family M of candidate partitions, we build a piecewise 
constant estimator (histogram) on each of them and then use the data to select 
one estimator in the family. Choosing the square of a Hellinger-type distance as 
our loss function, we show that each estimator built on a given partition satisfies 
an analogue of the classical squared bias plus variance risk bound. Moreover, 
the selection procedure leads to a final estimator satisfying some oracle-type in- 
equality, with, as usual, a possible loss corresponding to the complexity of the 
family M. When this complexity is not too high, the selected estimator has a 
risk bounded, up to a universal constant, by the smallest risk bound obtained for 
the estimators in the family. For suitable choices of the family of partitions, we 
deduce uniform risk bounds over various classes of intensities. Our approach ap- 
plies to the estimation of the intensity of an inhomogenous Poisson process, among 
other counting processes, or the estimation of the mean of a random vector with 
nonnegative components. 



1. Introduction 

The aim of the present paper is to design a new model selection procedure in 
a statistical framework which is general enough to cope simultaneously with the 
following estimation problems. 

Problem 1: Estimating the means of nonnegative data. The statistical 
problem that initially motivated this research was suggested by Sylvie Huet and 
corresponds to the modeling of data coming from some agricultural experiments. In 
such an experiment, the observations are independent nonnegative random variables 
Ni with mean Sj where i varies among some finite index set X. In this framework, 
our aim is to estimate the vector (sj)ieAT- 

Problem 2: Estimating the intensity of a Poisson process. We recall that 
a Poisson process N on the measurable set (X, A) with finite mean measure v is a 
random measure N on X such that 

• for any A £ A, N(A) is a Poisson random variable with parameter v(A)\ 

• for any family Ai, . . . ,A n of disjoint elements of A, the corresponding ran- 
dom variables N(A\), . . . , N(A n ) are independent. 
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We can always assume that v is finite by suitably restricting the domain of ob- 
servation of the process. When the mean measure v is dominated by some given 
measure A on X then the nonnegative function s = dv/dX is called the intensity of 
N. A Poisson process can be represented as a point process on the set X. Each 
point represents the time (if X = R + ) or location of some event. For example, the 
successive times of failures of some machine can be represented by a Poisson process 
on X = The intensity of the process models the behaviour of the machine in 
the following way: the intervals of times on which the intensity takes large values 
correspond to periods where failures are expected to be frequent and in the opposite, 
those on which the intensity is close to are periods on which failures are rare. In 
this statistical framework, our aim is to estimate the intensity s on the basis of the 
observation of N. 

Problem 3: Estimating a hazard rate. We consider an n sample Xi,...,T n 
of non-negative real valued random variables with common density p (with respect 
to the Lebesgue measure on R + ) and assume these to be (possibly) right-censored. 
This means that there exists i.i.d. random variables Ci, . . . ,C n such that we actu- 
ally observe the pairs Xj = (Tj,Dj) for j = l,...,n with Tj = min {Tj,Cj} and 
Dj = l Jrr _~ , . Such censored data are common in survival analysis. Typically, 
corresponds to a time of failure or death which cannot be observed if it exceeds time 
Cj. Our aim, here, is to estimate the hazard rate s of the defined for t > by 
s(t)=p(t)/F(T 1 >t). 

Problem 4: Estimating the intensity of the transition of a Markov process. 

Let {Xt, t > 0} be a Markov process on M + with cadlag paths and a finite number 
of states. We distinguish two particular states, named and 1, and assume that 
is absorbant and that there is a positive probability to reach 1. Our aim is to 
provide an estimation of the intensity of the transition time T\$ from state 1 to 
0. Typical examples arise when means "death", "failure", .... An alternative 
example could be the situation where T\q measures the age at which a drug addict 
makes the transition from soft drugs (state 1) to hard drugs (state 0). In this case 
we stop the chain at making this state absorbing. For t > 0, we denote by Xt~ 
the left-hand limit of the process X at time t and assume that for some measurable 
nonnegative function p, P(Ti ) o < t) = jQp(u)du. Note that p is merely the density 
of Ti t Q if Ti 5 o < +oo a.s. which we shall not assume. Our aim is to estimate the 
transition intensity s of Tip which is defined for t > by s(t) = p(t) /¥ (X t ~ = 1). 

For pedagogical reasons mainly, since it has already been extensively studied and 
can therefore serve as a reference, it will be interesting to consider also the much 
more classical 

Problem 0: Density estimation. It is the problem of estimating an unknown 
density s from n i.i.d. observations Xi, . . . , X n with this density. 

All the problems described in the above examples amount to estimating a function 
s mapping X to K+. For this purpose, we choose a family M of partitions of X and 
for each m G M. we design a non-negative estimator s rn of s which is constant on 
the elements of this partition. We shall call such an estimator an histogram- type 
estimator. The performance of s m depends on both s and m. Since s is unknown, we 
cannot pick the partition which leads to the best estimator. To select a partition in 
M, we shall rather use a method solely based on our data leading to some random 
partition rh and define our resulting estimator as s^. Our objective is to design 
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the selection procedure in such a way that performs almost as well as the best 
estimator among the family {s m ,m £ M}. 

The purpose of this paper is to describe some general setup which allows to 
deal with all the five problems simultaneously, to explain the construction of our 
histogram- type estimators s m , to design a suitable selection procedure rh and to 
study the performance of the resulting estimator s^. We shall illustrate our results 
by numerous examples of family of partitions and target functions s of interest. 
For the problems of estimating the intensity of a Poisson process or a hazard rate 
on the line, our method provides estimators than can cope with different families 
of functions simultaneously, including monotone, Holderian, or piecewise constant 
with a few jumps with unknown locations and sizes. In the multivariate case, we 
shall also provide some special method for estimating Poisson intensities with a few 
spikes with unknown locations and heights. 

The problem of estimating s by model selection in the first four setups described 
above did not receive much attention in the literature with a few noticeable excep- 
tions. Problem 1 is generally viewed as a regression problem where the mean Sj takes 
the form f(xi) for some design points Xi (typically / is defined on [0, 1] and X{ = i/n). 
To perform model selection, one introduces a wavelet basis and performs a shrinkage 
of the estimated coefficients of / with respect to this basis. This amounts to select- 
ing which coefficients will be kept. To this form of selection pertain the papers by 
Antoniadis, Besbeas and Sapatinas (2001), Antoniadis and Sapatinas (2001). Closer 
to our approach is Kolaczyk and Nowak (2004) based on penalized maximum likeli- 
hood. Unlike ours, their approach requires that the means Sj be uniformly bounded 
from above and below by known positive constants. For Problem 2, a similar ap- 
proach based on wavelet shrinkage is developed in Kolaczyk (1999), but the reference 
result is Reynaud-Bouret (2003). Problems 3 and 4 amount to estimating Aalen's 
multiplicative intensity s of some counting process with a bounded number of jumps. 
The problem of non-parametric estimation of Aalen's multiplicative intensities has 
been considered by Antoniadis (1989) who uses penalized maximum likelihood esti- 
mation with a roughness penalty and gets uniform rates of convergence over Sobolev 
balls. Van de Geer (1995) considers the Hellinger loss and establishes uniform esti- 
mation rates for the maximum likelihood estimator over classes of intensities with 
controlled bracketting entropy. Gregoire and Nembe (2000) extend the results of 
Barron and Cover (1991) about density estimation to that of intensities. Wu and 
Wells (2003) and Patil and Wood (2004) derive asymptotic results for thresholding 
estimators based on wavelet expansions. All these results, apart from those of van 
de Geer, are of an asymptotic nature. Reynaud-Bouret (2002) introduces a model 
selection procedure to estimate the intensity. A common feature of these papers lies 
in the use of martingales techniques (apart from Gregoire and Nembe, 2000). Unlike 
theirs, our approach does not require any martingale argument at all. 

In Section we present a general statistical framework which allows to handle 
simultaneously all the examples we have mentioned. We also make a review of some 
special classes of target functions and the various families of models (partitions) to 
be used in our estimation procedure. The treatment of our five estimation problems 
is provided in Sections 0] and 03 The results presented there derive from a unifying 
theorem to be found in Section EJ The remainder of the paper is devoted to the 
most technical proofs. 
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In the sequel, we shall make a systematic use of the following notations: constants 
will be denoted by C,C , c, . . . and may change from line to line; we denote by N* 
the set of positive integers and we write x Ay for min{x, y}, x V y for max{x, y} and 
\m\ for the cardinality of a set m. 

2. Presentation of our method 

2.1. A general statistical framework. We consider an abstract probability space 
(Q, £, P) and a measurable space (X, A) bearing a nonnegative <r-finite measure A. In 
the sequel E will denote the expectation with respect to P. We then consider on X a 
nonnegative bounded random process Y = Y(x,u), i.e. a measurable function from 
X x O to M + , and the nonnegative random measure M on X given by dM = YdX. 
Besides M, we also observe a nonnegative random measure N on X which satisfies 



1) ¥,[N(A)]=E 



sdM 

A 



< +oo, for all A £ A, 



for some deterministic nonnegative and measurable function soii^". Note that this 
assumption implies that TV" is a.s. a finite measure. Our aim is to estimate s from the 
observations ./V and M. Hereafter, we shall deal with estimators that belong to the 
cone C of nonnegative measurable functions t on X x Q, such that E \J X t dM~^ < +oo. 
Note that s also belongs to C To measure the risks of such estimators, we endow C 
with the quasi- distance (since we may have H(t, t') = with t ^ t') H between two 
elements t and if of C by 



H 2 (t,t') = J (y~t-Vt/^ 2 ' dM, 



and set as usual, for t £ £ and T C C, H(t, J 7 ) = mff e jr H(t, /). Given an estimator 
s of s, i.e. a measurable function of N and Y with s £ C, we define its risk by 
E [H 2 (s,s)]. In most of our applications, Y is identically equal to 1 in which case 
M = A is deterministic and if t and t' are densities with respect to M, H is merely 
the Hellinger distance between the corresponding probabilities. Only the cases of 
Problems 3 and 4 require to handle random measures M. 

In order to define our estimators we assume that 

(2) F[N(A) > and M(A) = 0] = for all A £ A, 

a property which is automatically fulfilled when M = A is deterministic because 
of Q. 

2.2. Histogram-type estimators. Let us now introduce the histogram-type esti- 
mators s m based on some finite partition m of X. We consider the subset J = {A £ 
A | E[M(A)] < +00} of A and define the model S m as the set of (possibly random) 
nonnegative piecewise constant functions on X: 



<t= ^ tih */ = */M € M for all I e m,u E fi 1 f]£. 
I /emnj J 



We then define the histogram estimator s m as the element of S m given (with the 
convention 0/0 = 0) by 

N(I) 



E 
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Note that s m is a.s. well-defined because of ©. We shall, hereafter, call it the 
histogram estimator based on m. 

Under suitable assumptions that will be satisfied for Problems 0, 1 and 2 (the 
case of hazard rates and Markov processes being more complicated), we shall prove 
for s m a risk bound of the form 

(3) E[H 2 (s m ,s)} <C {E[(H\s,S m ))]+C P \m\}, 

where Co is a numerical constant and Cp depends on the problem we consider. For 
instance, Cp = n _1 for density estimation and Cp = 1 for estimating the intensity 
of a Poisson process. We recover here the usual decomposition of the risk bounds 
into an approximation term which involves the distance of the parameter from the 
model and a complexity term proportional to the number \m\ of parameters that 
describe the model. 

2.3. The selection procedure. Given the family of models {S m ,m G M.} corre- 
sponding to a finite or countable family M of partitions m, we consider, in order to 
define our model selection procedure, the possibly enlarged family 

M = {m V m! for to, to' G M}; to V to' = {If] I' \ I G to, /' G to' , I n I' ^ 0}, 

so that to V to' is again a finite partition of X . 

We shall systematically make the following assumption about the family M. 

H : There exists some S > 1 such that \m\Jm'\ < 5 (|m| + \m'\) for all (to, to') G M 2 . 

We then introduce a penalty function "pen" from M to M+ to be described below 
and, for m / m' 6 M we consider the test statistic 

(4) T m>m >(N) = i? 2 (s m ,s mV m') - H 2 (s m i , s mVm >) + 16[pen(m) - pen(TO')]. 

The corresponding test between to and m' decides m if T mm / < 0, m! if T m , m i > 
and at random if T m ^ m > = 0. Note that the tests corresponding to T m<rn > and T m / tm 
are the same. We then set, for all m G A4, 

7Zm = {m G -M, m! ^ to, | the test based on T m ^ m i rejects to} 

and, given some e > 0, we define rh to be any point in Ai such that 

(5) V{m) < inf V(m) + e/3 with V{m) = sup {H 2 (s m , s m >)\ . 

This model selection procedure results in an estimator s = s„ that we shall call 
'penalized histogram estimator (in the sequel PHE, for short) based on the family of 
models {S m ,m G M.} and the penalty function pen(-). As to the penalty, it is the 
sum of two components: pen(?n,) = ci\m\ + C2A m with c\ and C2 depending on the 
framework and A m being a nonnegative weight associated to the model S m . We 
require that those weights satisfy 

(6) ex Ph A ™] = S < +oo. 

If S = 1, the choice of the A m can be viewed as the choice of a prior distribution 
on the models. For related conditions and their interpretation, see Barron and 
Cover (1991), Barron, Birge and Massart (1999) or Birge and Massart (2001). The 
constant 16 in @ plays no particular role and has only been chosen in order to 
improve the legibility of our main results. Our selection procedure can be viewed 
as a mixture between a method due to Birge (1983 and 2006) based on testing and 
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an improved version of the original Lepski's method, as described in Lepski (1991) 
and subsequent work of the same author. This improved version was presented by 
Lepski in a series of lectures he gave at Garchy in 1998. 

2.4. Risk bounds for the procedure. As we shall see later, with a suitable choice 
of s, the performances of this procedure for Problems 0, 1 and 2 are described by 
risk bounds of the following form: 



(7) E [H 2 (s, s)] < C inf {E [(H 2 (s, S m ))] + C P \m\ [l + \rn\- 1 (A m + S 2 )] } , 



where C is a numerical constants and Cp as in ©. Comparing Q with ©, we see 
that the estimator s achieves a risk bound comparable, up to a constant factor, with 
the best risk bound obtained by the estimators s m provided that £ is not large and 
A m not much larger than \m\. Note that these two restrictions are, to some extent, 
contradictory since the smaller A m , the larger E, although it is clearly unnecessary 
to choose A m smaller than \m\. Therefore, if Y2m£M e ~^ * s no ^ l ar S e > one can 
merely take A m = \m\. Otherwise, the choice of the A m will be more delicate but 
we should keep in mind that, if X is not large, the performance of s will be as good 
(up to a constant factor) as the performance of any s m for which A m < \m\. 



3.1. Some classes of functions of special interest. The motivations for the 
choice of some family of models {S m ,m S M} are twofold. First, there is the 
restriction that M should satisfy Assumption H and there are two main examples 
of such families. In the "nested" case, the family is totally ordered for the inclusion 
and thus, we either have m V m' = m or m V m' = m' for all m and m! in M . Then, 
M = M and 5 = 1. Another situation where Assumption H is satisfied with 5 = 1 
occurs when X is either 1R or some subinterval of R and each m £ M is a finite 
partition of X into intervals. 

The second motivation is connected to the approximation properties of the mod- 
els. If, for instance, we believe that the true s is smooth or monotone, one should 
introduce families of models that approximate reasonably well such functions. In 
the sequel, we shall put a special emphasis on the following classes of functions: 



• Monotone functions. For X an interval of R with interior X and R a positive 
number, we denote by S 1 (R) the set of monotone functions t on X such that 
sup o \t(x) — t(y)\ < R. 



• Continuous functions. Let w be a modulus of continuity on [0,1), i.e. a 
continuous nondecreasing function with ui(0) = — see additional details in 
DeVore and Lorentz (1993) — . We denote by S 2 (w) the set of functions t on 
[0, 1) such that \t(x + y) - t(x)\ < w(y) for all x G [0, 1) and < y < 1 - x. 
For < a < 1 and R > 0, the Holder class is the class S 2 (w) with 
w(y) = Ry a . More generally we say that a function u defined on V C [0, l) k 
for some k > 1 belongs to the set (V), a G]0, 1), R > 0, if 
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o 



k 




for all x, y E V. 
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• Piecewise constant functions. If the function t denned on [0, 1) is constant 
over some intervals and then jumps from time to time, it is a piecewise 
constant function of the form 

D 

(8) t = ^2 t k^[x k _ 1 ,x k ) with = x < xx < . . . < x D = 1. 

k=l 

We shall denote by S 3 (D,R) the class of such piecewise functions such that 
sup 1<fc< £> ifc < R. Note that this would correspond to a parametric model 
with D parameters if the locations of the jumps were known. We shall restrict 
our attention to D > 2 since 5 3 (l,i?) only contains constant functions and 
is then a subset of S 2 (w) with w = 0. 

• Besov balls and functions of bounded variation. Here we consider func- 
tions t defined on [0,1). Given positive numbers a,p and R, we denote 
by £>p jOC) (-R), the closed Besov ball of radius R centered at zero of the Besov 
space -Bp )OO ([0, 1)), i.e. the set of functions t in this space with Besov semi- 
norm Itls^ < R. Analogously, we set Bbv{R) for the set of functions t 
of bounded variation with Var*(i) < R. We refer to Chapter 2 of the book 
by DeVore and Lorentz (1993) for details on Besov spaces and the defini- 
tion of Besov semi-norms, functions of bounded variation and the variation 
semi-norm Var*. Note that S±(R) C Bbv{R)- We shall also consider the 
multidimensional Besov spaces -Bp iOO ([0, l) k ) for k > 2. 



3.2. Some typical models. Let us now describe a few useful families of models 
and corresponding choices for the weights A m that satisfy ([Bjl. 



3.2.1. Example 1: models for functions on [0, 1). The following models are suitable 
for approximating functions belonging to the classes that we just mentioned. Since 
they are based on partitions of [0, 1) into intervals, they satisfy Assumption H with 
6 = 1. Let Ji = jj2~', j G N} and = VJi^Ji be the set of all dyadic points 
in [0, 1). To build M, we consider partitions m = . . . ,Id} of [0, 1)) generated 
by increasing sequences {0 = xo < x\ < . . . < xr> = 1} with ij = [xj_i,Xj). We 
then define A4 to be the set of all such partitions with X{ € Joq for 1 < i < D — 1. 
Therefore, whatever m € Ai, the elements of S m are piecewise constant functions 
with D pieces and jumps located on the grid J^. The novelty of this particular 
family of partitions lies in the fact that there is no lower bound on the length of 
the intervals on which the partitions are built. It will be useful to single out the set 
M.r = {m,k,k G N} of regular dyadic partitions where is the partition of [0, 1) 
into 2 k intervals of length 2~ k . In particular, tuq = [0, 1). 

One possible way of defining the corresponding weights A m is as follows. For 
/ E N* and 2 < D < 2 l we define A4/ t £) as the set of all partitions m with 
\m\ = D and / is the smallest integer such that {x±, . . . , xd-i} C Ji- Then, 

M = U/>i (Ud=2^!,b) U{ m o}- We choose A mo = 1 and 

(9) A m = £>(/log2 + 2-log J D) + 21og/ if m E Mi D . 
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Since \Mi,d\ < < (J) < (2 l e/D) D , we derive from © that 



2< 

^ exp[-A m ] < |M,Dr 2 exp[- J D(nog2 + 2-logZ))] 

mGA4\{m } Z>1 D=2 

< VV rV* = < 0.14 



and it follows that (jBJ) is satisfied. 



3.2.2. Special partitions derived from adaptive approximation algorithms. It is easily 
seen that the family Ai of partitions we introduced for Example 1 is too rich for 
choosing A m = c\m\ for all m and c a fixed constant since then ((BJ) would not be 
satisfied. For partitions in Aii d with / > D, A m behaves as l\m\ and I can be 
arbitrarily large. Fortunately, there exists a subset Ai^ of .M, which is of special 
interest because of its approximation properties with respect to functions in Besov 
spaces, and such as it is possible to choose A m = 2\m\ for m £ Ai^. This will 
definitely improve the performances of the PHE for estimating functions in Besov 
spaces. Let us now describe Ai\. 

Among all partitions on [0, 1) with dyadic endpoints, some of them, which are in 
one-to-one correspondance with the family of complete binary trees, can be derived 
by the following algorithm described in Section 3.3 of DeVore (1998). One starts 
with the root of the tree which corresponds to the interval [0, 1) and decides to divide 
it into two intervals of length 1/2 or not. We assume here that all intervals contain 
their left endpoint but not the right one. If one does not divide, the algorithm 
stops and the tree is reduced to its root. If one divides, one gets two intervals 
corresponding to adding two sons to the root. Then one repeats the procedure with 
each interval and so on. ... At each step, the terminal nodes of the tree correspond 
to the intervals in the partition and one decides to divide any such interval into two 
equal parts or not. Dividing means adding two sons to the corresponding terminal 
node. The whole procedure stops at some stage producing a complete binary tree 
with D terminal nodes and the corresponding partition of [0, 1) into D intervals. 
This is the type of tree which comes out of an algorithm like CART, as described by 
Breiman et al. (1984). Such constructions and the corresponding selection procedure 
resulting from the CART algorithm have been studied by Gey and Nedelec (2005). 
We denote by Ai\, the subset of Ai of all partitions that can be obtained in this 
way. Note here that the set A\r of regular partitions is a subset of Ai^- 

It is known that the number of complete binary trees with j + 1 terminal nodes is 

given by the so-called Catalan numbers (1 ( 'j?^ as explained for instance in 

Stanley (1999, page 172). As a consequence, we can redefine A m = 2\m\ for m £ M\ 
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and, using the fact (which derives from Stirling's expansion) that ( 1 < 4? , get 



J 



exp[-A m ] < Yl exp[-2(i + l)] 

m^M\ i>0 {m^M\ | \m\=l+j} 

'2j 



exp[-2(j + l)] 

Y±±l < e - 2 Y (2/e) = s;. 

i>o J i>o J 

Finally © is satisfied with £ < S' x + 0.14. 

3.2.3. Example 2: estimating functions with radial symmetry. There are situations 
where one may assume that the value of s(x) only depends on the Euclidean distance 
1 1 a; 1 1 between this point and some origin in which case one can write s(x) = $(||z||). 
In such a case, it is natural to estimate s on a ball, which we may assume, without 
loss of generality, to be the open unit ball Bk of To any partition m of [0, 1) 
we can associate a partition of Bj- with elements J = {x \ \\x\\ G 1} where / denotes 
an element of m. For simplicity, we shall identify the two partitions (the first one 
of [0, 1) and the new one of Bk) and denote both of them by m. In the sequel, we 
shall focus our attention on the family of partitions of Example 1 with the weights 
defined in Section 13.2.21 

3.2.4. Example 3: estimating functions on [0, l) k ,k > 2. To deal with the case 
X = [0, l) fc , let us first introduce some notations. For j G N we consider the set 

Mj = 1 1 = (h, . . . , l k ) G N fc 1 < k < 2 j for 1 < i < /c j 

and for j G N and I G Mj the cube K. i given by 

K j t = {x = (zi, . . . , x k ) G [0, l) fe | (k - \)Ti < Xi < l % 2~ 3 for 1 < i < A;} . 

We set Kj = G Nj\ and K, = Uj>o^r 

Let V be the collection of all finite subsets p of K, \ /Co consisting of disjoint cubes. 
To each p G V, we associate the positive quantity J(p) = inf{j | pUfCj ^ 0} ( J(0) = 
+oo) and the partition m p generated by p, i.e. m p = {I G p} [j {[0, l) k \ U/ ep /} 
provided that this last set is not empty and m p = {I G p} otherwise. We finally set 
M. = {m p V ICj with p G V and j < J(p)}- Note here that the mapping (J,p) i— > 
m p V /Cj is not one to one. For instance m V fCj = K.j = /Cj V Kj-%. We shall prove 
in Section \7. II the following result: 

Lemma 1. The family M satisfies Assumption H with 5 = 2. 

In order to define the weights A m , we shall distinguish a special subset A4^< of M. 
which is the A>dimensional analogue of the one we considered in Section 13.2.21 Here 
one starts the algorithm with X = [0, l) fc (which corresponds to the root of the tree) 
and at each step get a partition of X into a finite family of disjoint cubes of the form 
K. One then decides to divide any such cube into the 2 k elements of /Cj+i which 

are contained in it or not. Again, this corresponds to growing a complete 2 fc -ary tree, 
partioning a cube meaning adding 2 k sons to a terminal node and the set of 
all partitions that can be constructed in this way corresponds to the set of complete 
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2 fc -ary trees. As for k = 1, M\ contains the set Mr = {m V JCj, j > 0} of all 
regular partitions of X into 2 fej ' cubes of equal volume. Working with M instead of 
the much simpler family Mr allows to handle less regular functions like those which 
have a few spikes or are less smooth on some subset of X . 

If m £ M\ we take A m = \m\ and otherwise we set 

A' jp = j + kY,(j + b n K 5+ i\ for peV and j < J{p) 

i>l 



and 



(10) A m = inf {A' } for me M\M%. 

{U,p)\m=m p VK j } 

Note that the ratio A m /\m\ is unbounded for m .M^ as shown by the example of 
m = m p V /Co with p reduced to a single element of JCj, j > 0. Then |m| = 2 while 
A m = fcj may be arbitrarily large. For the partitions m belonging to M\ we use the 
fact — see Stanley (1999) — that any complete Z-ary tree has a number of terminal 
nodes of the form 1 + j(l — 1) for some j £ N and that the number of such trees with 

1-1 I lj | t^„ 1 _ nk 



1 + j(l — 1) terminal nodes is [1 + j(l — 1)] ( . I . For Z = 2 we derive that the 

number of partitions in .M^ with 1 + j(2 fc — 1) elements is [1 + j{2 k — 1)] _1 ( . J . 
Moreover, since A; > 2, we check that 

A m > log 2 + 1) + log(i + 1) if |m| = 1 + j(2 fe - 1). 
Since < (Ze) J ', it follows that 

E ^ exp[-j(felog2 + l)] 
exp[-A m ] < ^ ^ — 

m^M% J>0 {meMtp I |m|=l+i(2 fe -l)} 



< 



(2 fe e) 

^(j + i)[i+i(2 fe -i)] 

U (i + !)[!+ J(2 fc -1)] = ^ 



Let us now turn to the partitions of the form m p V/Cj . For such a partition pfl/C,' = 
for j' < j and, for i > 1, |pn/Cj+i| = Zj with < Zj < 2 fc ( J+J ). Moreover, the number 
of those p £ V such that |pD /Cj+i| = Zj for a given sequence I = (h)i>i with a finite 

/ 2 fc(j+i)\ 

number of nonzero coefficients is bounded by J3i>i ( J . It follows from (JTUJ) 
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that 



exp[-A m ] < E e-^e- M3+,)|pn ^ + ' 1 

j>0{p<=V\ J(p)>j} »>1 

E^E E IT 



< 



-Jfc(i+»)'i 



< 



< 



i>° i {p| \pn)Cj + i\=k for i>l} *>! 



e^eii 

i>o ; »>i 

2 fc 0'+ i ) 

E«- j n e 

j>0 i>l z i= o 



2*0'+*) 



,-Jfc(i+*)'» 



2*0'+*) 



-k{j+i)li 



i>0 i>l 



= E ex p 

i>o 
< ^exp 



+ ^ 2 fc0 +i) log (i + e -*C?+0) 



i>l 



-fc(j+i) 



i>l 



Sfc < +00. 



Finally we can conclude that © holds with £ < El + S^. 



3.2.5. Models for n- dimensional vectors. To handle the problem we started with 
in the introduction, we may assume that our finite index set I is actually X = 
{1, . . . , n}, the estimation of the function s from X to R+ amounting to the estima- 
tion of the vector (s\, . . . , s n )* € with coordinates = s(i). 



Example 4- If one assumes that either Sj varies smoothly with i or is monotone or 
piecewise constant with a small number of jumps, it is natural to choose for m a 
partition of X into intervals and for A4 the set of all such partitions. Note that this 

family satisfies Assumption H with 5 = 1. Setting here A m = \m\ + log 

we get © with E < (e — l)" 1 since there are ^ partitions in M. with D 

elements for 1 < D < n. 



L n 7 l ) 

V \m\ — 1 J 



Example 5. An alternative case is the case when s is constant, equal to s on X 
except for a few number of locations i where s(i) 7^ s. Since the number k of such 
locations is unknown, it is natural, for each k £ {0, ... ,n — 1} to define as the 
set of partitions of X with k singletons and the set of the n — k remaining points. 
We finally set M. = \J$<k<n-iMk- Then Assumption H holds with 5 = 1. For 

m G Aik, \iti\ = k + 1 and we set A m = log (j^j + ^ = 1°6 ( ^\ + \ m \ ~ 1> so 
that ® holds with E < e/(e - 1). 
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4. The case of a deterministic measure M 



Let us now see how our general framework applies to Problems 1 and 2. Besides 
these, our setup also covers the problem of density estimation. Although there is 
a huge amount of literature on density estimation, our method brings some im- 
provements to known results on partition selection for histograms. Moreover, since 
this problem has attracted so much attention, it can serve as pedagogical exam- 
ple and reference for the sequel. This is why, before considering more original and 
less studied frameworks, we shall start our review by this quite familiar estimation 
problem. 



4.1. Density estimation. We consider the classical problem of estimating an un- 
known density s from a sample of size n, which means that we have at hand an 
i.i.d. sample X\, . . . ,X n from a distribution with unknown density s with respect 
to some given measure M = A on X. We define N to be the empirical distribution: 



N(A) 



n 



1 J21=i 1-XjeA- Then, as required, E [iV(^4)] = J A sd\ for all measurable 



subsets A of X. In this case the distance H is merely a version of the Hellinger 
distance between densities. 

Within this framework, we can prove the following general result. 

Theorem 1. Assume that the family Ai satisfies Assumption H and the weights 
{A m , m E M} are chosen so that J§J) holds. Then the penalized histogram estimator 
s = s r - ri defined in Section HOI with pen(m) > n^ 1 (85\m\ + 202A m ) satisfies 



E [H 2 ( 



< 



390 | inf (H 2 (s, S m ) + pen(m)) + 



101£ z 



n 



+ e 



A* 



The only previous works on partition selection for histograms using squared 
Hellinger loss we know about are to be found in Castellan (1999 and 2000) and 
Birge (2006). Castellan's approach is based on penalized maximum likelihood. This 
requires to make specific restrictions on the underlying density s, in particular that 
s should be bounded away from 0. For the problem of estimating a density on R, her 
conditions on the family of partitions are also more restrictive than ours since we 
can handle any countable families of finite partitions into intervals. Nevertheless, in 
the multivariate case, our assumptions on the partitions are more stringent. Birge's 
approach based on aggregation of histograms built on one half of the sample leads 
to more abstract but more general results. 

Let us now apply the above theorem to various families of models, systematically 
setting pen(m) = n _1 (85|m| +202A m ) and e = n~ l . We assume in this section that 
A is the Lebesgue measure on X . 



4.1.1. Example 1, continued. When X = [0,1), we use the family of models and 
weights of Section 13.2.11 Our next proposition shows that the PHE based on this 
simple family of models and weights has nice properties for estimating various types 
of functions. The proof will be given in Section 17.31 

Proposit ion 1. Let s be the PHE based on the family of models and weights A m 
defined in Section \'J.2.1i e = n~ l and the penalty function pen(m) = n~ l {&5\m\ + 
202A m ). 
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i) Ifse Si(R), then 

(11) E [H\s, s)) < C { [Rn- 1 log (l + nR 2 )] 2/3 V n" 1 } . 

ii) If \fs G S 2 {w) where w is a modulus of continuity on [0,1), we define x w to 
be the unique solution of the equation nxw 2 (x) = 1 if w(l) > n -1 / 2 and x w = 1 
otherwise. Then 

(12) E [H 2 (s,s)] < C{nx w )- 1 . 

If, in particular, ^fs belongs to the Holder class with R > n -1 / 2 , then E s [H 2 (s, s)] < 

CR 2/(2a+l) n -2a/(2a+l)_ 

mj If s £ S 3 (D,R) with 2 < D < n and R> 2, we get 

(13) E s [# 2 (s, s)] < CDn- 1 log (ni?/L>) . 

It is interesting to see to what extent the previous bounds (together with the 
trivial one, E [H 2 (s, s)] < 2, which always holds but which we did not include in 
(fTTj) . ((T2*|) and ((T3*|) for simplicity) are optimal (up to the universal constants C). 
Many lower bounds on the minimax risk over various density classes are known for 
classical loss functions. For squared Hellinger loss, some are given in Birge (1983 
and 1986) and Birge and Massart (1998). Many more are known for the squared 
L2-I0SS, which can easily be extended to squared Hellinger loss because their proofs 
are based on perturbations arguments involving sets of densities for which both 
distances are equivalent. It follows from these classical results that the bound we 
find for continuous densities are actually optimal (see Birge, 1983, p. 211) while (|11|) 
is suboptimal because of the presence of the log factor. We shall see below that the 
more sophisticated penalization strategy introduced in Section [3.2.21 does solve the 
problem. The case of piecewise constant functions is more complicated. If D and 
the locations of the jumps were known, one could use a single model corresponding 
to the relevant partition with D intervals and get a risk bound CD/n corresponding 
to a parametric problem with D parameters. Apart from the constant C, this 
bound cannot be improved which shows that the study of uniform risk bounds over 
S 3 (D,R) is only of interest when D < n since otherwise a lower bound for the 
risk is of the order of the trivial upper bound 2. When D is smaller than n the 
extra \og{nR/D) factor in (|13|) is due to the fact that we have to estimate the 
locations of the jumps. The problem has been considered in Birge and Massart 
(1998, Section 4.2 and Proposition 2) where it is shown that a lower bound for 
the risk (when n > 5D and D > 9) is cDn^ 1 log (nZ) -1 ). Therefore our bound is 
optimal for moderate values of R. We do not know whether the log it! factor in the 
upper bound is necessary or not. 

4.1.2. Improved risk bounds with a better weighting strategy. If we use the weights 
A m defined in Section T3.2.2I to build s, we can only improve (up to constants) the 
risk bounds given in Proposition ^ since the value of £ does not change much while 
the new weights are not larger than the previous ones. Besides, the values of the 
weights have been substatially decreased for the partitions belonging to It turns 
out that piecewise constants functions on the elements of Aiip possess quite powerful 
approximation properties with respect to functions in Besov spaces ^([0, 1)) with 
a < 1 and monotone functions. These properties are given in the following theorem 
which also includes the multidimensional case. 
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Theorem 2. Let X = [0, l) k , M\ be the set of partitions m of X defined in Sec- 
tion \3.2J\ and, for m G M^, let S' m be the cone [t = Yliem^i^i^i — ®} • F° r an y 
p > 0, a with 1 > a > k(l/p— 1/2)+ and any function t belonging to the Besov space 
Bp iOO ([0, l) k ) with Besov semi-norm I^B^, one can find some t' G UmeM fc ^'m suc ^ 
that 

(14) ||i-i'|| 2 < C{a,k,p)\t\ B? Jm\- a l\ 
where \\ ■ ||a denotes the ~L2{dx)-norm on [0, l) k . 

If t is a function of bounded variation on [0, 1), there exists t' G UmeA'! 1 ^'m such 
that \\t-t'\\ 2 < CVar*^)]™]- 1 . 

The bound (|14|) is given in DeVore and Yu (1990). The proof for the bounded 
variation case has been kindly communicated to the second author by Ron DeVore. 
With the help of this theorem, we can now derive from Theorem ^ the following 
improved bounds the proof of which is straightforward. 

Proposition 2. Let s be the PRE based on the weights A m defined in Section \S.2.2\ 

If yfs is a function of bounded variation with Var* {^fs) < R and in particular if it 
belongs to S l (R), then 

(15) E[H 2 (s,s)] < min jc(i?/n) 2 / 3 , 2 j for R > n~ 1/2 . 

If v/s G fi" ([0,1)) with 1 > a > (1/p - 1/2)+ and \Js\ Ra < R with R > n~ x l 2 , 
then 

E [H 2 Cs,s)] < mm{c(a,p)R 2 / ( - 1+2a K~ 2a / ( ~ 1+2a \2} . 

It follows from classical lower bounds arguments that these bounds are minimax 
up to constants. 

4.1.3. The multidimensional case. When the density s defined on X = Bk can be 
written s{x) = <J>(||x||) for some function on [0, 1), we use the family of models 
introduced in Example 2. We then obtain the risk bounds given in Propositions ^ 
and |21 if we replace the assumptions on s by the same on <]?. We omit the details. 

If X = [0, l) fc , k > 2 and we use the family of models and weights described in 
Section 13.2.41 we get the following result. 

Proposition 3. Let R > k^n' 1 / 2 . If ^fs belong to H^([0,l) k ), then 

(16) E [H 2 {s, s)) < min ^C(Rk) 2k ^ k+2a ^n- 2a ^ 2a+k \ 2} . 

More generally, if -y/s belongs to Bp OO ([0,l) k ) with 1 > a > k(l/p — 1/2)+ and 
\\/s\ Ra < R, then 

-°p,oo 

E [H 2 (s,s)] < mm{c(a,k,p)R 2k ^ k+2 ^n- 2a ^ k+2a \ 2} . 

Proof: Let m = ICj be an element of Mr. Then A m = \m\ = 2 fcj and the maximal 
variation of a function of Ti^ ([0, l) k ) on an element of m is bounded by Rk2~i a so 
that H 2 (s,S m ) < (Rk) 2 2- 2 i a . It then follows from Theorem Q] that E [H 2 (s,s)] < 
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C [(Rk) 2 2 2ja + n 1 2 fc - ? ] . The lower bound on R allows us to choose j £ N such 
that V < (n(Rk) 2 ) 1/{k+2a) < 2^ +1 which leads to 

E [H\s,s)] < C [{Rk) 2 2 2a (n(Rk) 2 )' 2a/{k+2a) + n~ l {n(Rk) 2 ) k/(k+2a) 

The first bound follows since 2 2a < 4. The second bound can be proved in the same 
way from (|14|1. [] 

4.2. Poisson processes. Let us consider the stochastic framework corresponding 
to Problem 2 where v is dominated by some given measure M = A on X with density 
s = du/dX. This implies that Q holds as required. In this case, the performances 
of the PHE s are as follows. 

Theorem 3. Assume that the family Ai satisfies Assumption H and the weights 
{A m , m S M} are chosen so that |2J) holds. Then the estimator s defined in Sec- 
tion \2.S\ with pen(m) > 3<5|m| + 6A m satisfies 



(17) E \H 2 {s, a)l < 390 inf \H 2 (s, S m ) + pen(m)l + 3S 2 



This theorem should be compared with the results of Reynaud-Bouret (2003) who 
uses more general families of projection estimators than just histograms based on 
partitions. Nevertheless, for the problem we consider here, her choice of the L 2 -loss 
induces some restrictions on both the intensity and the collection of partitions at 
hand. For instance, the intensity has to be bounded and the procedure requires 
some suitable estimation of its sup-norm. As Castellan (1999), she cannot deal with 
partitions with arbitrary small length. 

Let us now apply this theorem to our families of models, systematically setting 
pen(m) = 3<5|m| + 6A m and e = 1. In view of facilitating the interpretation of the 
results to follow, it is convenient to use an analogy with density estimation. This 
analogy, based on the following heuristics, allows to extrapolate the bounds from 
one framework to the other. 

We recall that observing the Poisson process N of intensity s is equivalent to 
observing iV i.i.d. random variables with density s', where iV = N{X) is a Poisson 
variable with parameter n = J x sd\ and s' = n~ l s. With this in mind, and even 
though n need not be an integer, we can view the estimation of s as an analogue of the 
estimation of the density s' from n i.i.d. observations. Pursuing into this direction, 
we may rewrite the risk in the Poisson case as E [H 2 (s, s)l = nE [ff 2 (n _1 s, s')l 
and, setting s n = n _1 s, view E [H 2 (s n , s')] = n _1 E [H 2 (s, s)1 as an analogue of the 
risk for estimating s' from n i.i.d. observations. When belongs to S 1 (R), S 2 (w) 
or S 3 (D,R), then the square-root of the density s' = s/n belongs to S 1 (Rn^ 1 / 2 ), 
S 2 (wn~ 1 ^ 2 ) or S^(D,Rn~ l l 2 ) respectively (provided that R 2 > n in the last case, 
since otherwise S 3 (D, Rn" 1 / 2 ) would not contain any density). From these two 
remarks, we may conclude that a risk bound of the form f{R) in the Poisson case 
should be interpreted in the density case as n _1 /(i?n -1 / 2 ). 

Example 1, continued. Here we deal with a Poisson process N on a finite interval 
of R, which we may assume, without loss of generality, to be [0, 1), of intensity 
s with respect to the Lebesgue measure v. To estimate s we use the family of 
models of Example 1 with the weights A m defined in Section 13.2.21 The resulting 
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PHE s has the following properties which can be proved exactly like those given in 
Propositions Q and [2j 

Proposition 4. Let w be a modulus of continuity on [0, 1). We define x w to be the 
unique solution of the equation xw 2 {x) = 1 if w{\) > 1 and x w = 1 otherwise. Then 

(18) E [H 2 (s, s)] < Cx' 1 for all s such that G S 2 {w). 
If, in particular, -^/s belongs to the Holder class with R > 1, then 

E [H 2 (s, s)] <CR 2 ^ 2a+1 \ 

Given D > 2 and R > 2D, we get 

(19) E [H 2 (s,s)] < CD\og{R/D) for all s G S 3 (D, R). 

If^/s belongs to S 1 {R) with R>1, then E [H 2 (s,s)] < CR 2 / 3 . 
If Js G B"([0, 1)) with 1 > a > (1/p - 1/2)+ and \^\ Ra < R with R>1, 
thenE [H 2 {s,s)] < CR 2 /^ 2 ^. 

For the sake of simplicity, let us assume that n = j x sdX is an integer. The 
connection established above between the estimation of a density and that of the 
intensity of a Poisson process shows that Proposition^is actually a perfect analogue 
of Propositions n an d El Namely, when belongs to S 1 (R) or S 2 (w) or s G 
S 3 (D,R) and s' = s/n then \fs' respectively belongs to S 1 (Rn^ 1 ^ 2 ) or S 2 (wn~ 1 / 2 ) 
or s' G S 3 (D, Rn^ 1 ) and the risk bounds we get for estimating the intensity s 
(with respect to the H 2 /n-loss) are the same as those obtained from a n sample for 
estimating the density s' (with the H 2 -loss). 

Example 2, continued. If we observe a Poisson process on X = Bj- with intensity 
s(x) = $(11x11) with respect to the Lebesgue measure for $ some function on [0, 1) 
and consider the family of models introduced in Example 1 we obtain the risk bounds 
given in Proposition 0] if we replace the assumptions on s by the same on $. 

Example 3, continued. If X = [0, l) k with k > 2, we use the models and weights 
defined in Section [3.2,41 Proceeding as for Proposition |21 we get: 

Proposition 5. Let y^s belong to TL^([0,l) k ), then 

E s [H 2 (s,s)} < C{Rk\Jl) 2k ^ k+2a \ 
If ■y/s belongs to -Bp )OO ([0, l) k ) with 1 > a > k(l/p - l/2) + and \</s\ B a < R, then 
E [H 2 {§, s)] < C(a, k,p)(R V i) 2fc /( fc + 2 °). 

As shown by the proof of Proposition |31 we only use the partitions in A4r to get 
(j!6j) so that it would be of little use to introduce other partitions if we only wanted to 
estimate intensities such that yfs belong to Hq([0, l) fc ). The interest of considering 
the larger family A4 and to have a special definition of A m when m G Mt is that it 
allows to improve the results when we deal with less regular functions than those for 
which yfs belong to Ti^ ([0, l) fc ), in particular those functions that belong to Besov 
spaces -Bp )OO ([0, l) fc ) with 1 > a > k/p. To illustrate this fact, let us study the 
estimation of those intensities s such that has the following specific structure. 
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Given the nonempty set V which is a finite union of elements of fC, there is a smallest 
integer J such that V can be written as the union of N elements of fCj with a volume 
V = N2~ ^ > 0. To avoid trivialities, we assume that J > 0, hence V < 1. 

Proposition 6. Let s be an intensity on [0, l) k such that i/sty belongs to Ti^(V) 
with R > 1 while y^lyc is constant and let s be the PRE based on the weights A m 
defined in Section \3.2.4\ Then 



(20) 

and 
(21) 

(22) 

(23) 



E [H 2 {s, s)] < C inf B m with B m = H 2 (s, S m ) + \m\ + A r 



B m < C min | 2 fc J + V k/{k+2a \kR) 



2k/(k+2a) . 



V 
V 



kj2 k > + (kR) 2k '( 2a+ V [log {Rk)] 2a/{2a+k) 



2 fe j2 fc J + (kR) 



2k/(2a+k) 



Proof: Since (|20j) is merely a consequence of Theorem 01 with the choice pen(m) = 
35\m\ + 6A m and e = 1, we only have to bound 5 m . Let us first consider a regular 
partition m = JCj. If j < J, the bias H 2 (s,S m ) may be arbitrarily large since 
the intensity s may be arbitrarily large on V while it may be small on V c . For 
j > J, the argument used for the proof of Proposition [21 shows that on V, can be 
approximated uniformly by an element of S m with a precision at least Rk2~ 3a so that 



I l/(2a+fc) 



< 2J 



< y 



This 



H 2 {s,S m ) < VR 2 k 2 2~ 2 i a and B rn < VR 2 k 2 2~ 2 ^ a + 2^ +1 . If [Fi? 2 A; 2 

we set j = J and otherwise choose j so that 2 J < [l/i? 2 /c 2 ] 1 /( 2a+fe ) 
leads to (|2"T|) . 

If we set m = m p V/Co with p being the set of those N2 k( - j ~~^ = V2 kj > 1 elements 
of JCj (j > J > 1) that exactly cover V, we get, since k > 2 



-B m < VR 2 k 2 2~ 2ja + (fcj + 1)F2 KJ + 1 < FA; 



J R 2 /c2" 



2ja 



+ 2j2 



If [fe 2 i? 2 /log(/fci?)] 1/(2Q+fc) < 2J we set j = j and otherwise choose j so that 2 3 < 
[k 2 R 2 /log(kR)] 1/(2a+k) < 2J+ 1 which finally leads to (221 . 

To study the approximation properties of the elements of Ad^ let us consider a 
particular cube K' = K-^ 6 Vfl /Cj . Identifying the partitions in M\ with the trees 

from which they derive, we can design an element mx' of Ai^ with 2 k — 1 terminal 
nodes at each level 1 to J and the remaining node K' at level J. Then we keep only 
non-terminal nodes up to level j > J, all nodes at this last level j being terminal, 
so that their number is 2 k ^^ . The total number of terminal nodes of the tree is 
therefore j(2 fe — 1) + 2 k ^~^. We can repeat this operation for each of the N cubes 
in V H /Cj keeping the value of j fixed. This results in TV similar trees. We finally 
consider the smallest complete tree m that contains the ./V previous ones. Its number 
of terminal nodes is then bounded by ./V \j(2 k — 1) + 2 fc ^ - ^] so that 



B m < V(Rk) 2 2- 2ja + 2N 



J(2 fc - 1) + 2^-j) 



< 2V 



i? 2 A .2 2 -2ia +J - 2 fe(j+l) +2 kj 



< 2 J we set j = j and otherwise choose j so that 2 J < (k 2 R 2 ) 



2 D 2\ V(2a+fc) 



lf(k 2 R 2 ) 1/(2a+k) 

which leads to q 

A comparison of the three bounds (|21jl. (|2*2*)) and (|2*3*)) shows that (|2*3*)) is always 



< 
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better if we omit the influence of j and k but the situation becomes more involved 
if we take into account the effect of k and j. Depending on the values of V, i?,J, a 
and k, each type of partition may be the best which justifies to introduce them all. 

Remark: An analogue of Proposition |H] holds for density estimation. 



4.3. Non-negative random vectors. Let us recall from the introduction that we 
observe an n-dimensional random vector with independent nonnegative components 
Ni, . . . , N n and respective distributions depending on positive parameters si, . . . , s n . 
One should think of the N{ as Poisson or binomial random variables with unknown 
expectations Sj. More generally, we assume that there exist some known constants 
k > and t > such that for all i G X = {1, . . . , n} 



log E 



D z(N z -Si) 



< K- 



Z 2 Si 



< 24 > -V-L- U-W-rr 

with the convention 1/r = +oo if r = 0, and 

-z(Ni-Si) 



for all z G 



1 

o,- 

r 



(25) 



log E 



< K 



Z 2 Si 



for all z > 0. 



In the case of Poisson or binomial random variables, one can take k 
shall see below. 



t = 1 as we 



Our aim is to estimate the function s from X to R+ given by s{%) = S{. Here 
we denote by A the counting measure on X and set Y = 1. Hence M = A and 
N(A) = J2i<=A N i- Then ^ can be identified with W^, E [N(A]] = f A sdX as required 



and H 2 (t, t') = 2^ILi VW) ~ \fW) f o r *> G -C- 

Theorem 4. Assume that \24\) and \25\l hold, that the family M satisfies As- 
sumption H and the weights {A m , m G A4} are chosen so that holds. Let 
pen(m) > k [5 (l + if 2 ) |m| + 3K 2 A m ] witt 



K = a/2 ifr<K; K 
and let s be the PRE defined in Section \2. cA Then 



V2 , /7 r 

h it t > k; 

2 \ n 2 



E [F 2 (s, a)l < 390 inf [^(s, S m ) + pen(m)l + (3/2)KiT 2 S 2 

meM 



+ £. 



Let us first check that some classical distributions do satisfy Inequalities (|24|) 
and (|25[). If A^j is a binomial random variable with parameters nj,pj then for all 



(26) 



log E 



z(Ni-s t ) 



< Si 



1) with Si = UiPi. 



If Ni is a Poisson random variable with parameter Sj, then equality holds in (|26j) . 



Using the bounds e z 



1 < z7[2(l-z)] for z G [0,l[ande 2 



1 < z 2 /2fovz < 



we derive that, in both cases, (|24j) and (|25|) hold with k = t = 1. If Ni has a Gamma 
distribution r(sj, 1), E [Ni] = Sj and, following the proof of Lemma 1 of Laurent and 
Massart (2000), we deduce that (J2U) and (J25J) hold again with k = r = 1. More 
generally, it follows from some version of Bernstein's Inequality — see Lemma 8 of 
Birge and Massart (1998) — that (|24jl holds as soon as 

E [(N) p \ < K^SiT p - 2 , for all i G X and p > 2. 
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Inequality l|25|) is always satisfied if N{ < k. Indeed it follows from 

e -zx < 1 _ zx + z 2 x 2j 2 ^ vx, z > 

that all non-negative random variables X bounded by k satisfy 

E [e~ zX ] < 1 - zE[X] + 1 J < exp (— zE [X] + kz 2 E[X]/2) . 

The results of Kolaczyk and Nowak (2004), which are based on some sort of dis- 
cretized penalized maximum likelihood estimator in the spirit of Barron and Cover 
(1991), have some similarity with ours but they assume that the components of the 
vector s belong to some known interval [c, C], c > and they explicitely use the 
values of c and C in the construction of their estimator. Such an assumption, which 
implies, as in the case of density estimation, that squared Hellinger distance and 
Kullback divergence are equivalent also greatly simplifies the estimation problem. 

Example 4, continued. Setting 

(27) pen(m) = k [(l + K 2 ) \m\ +3K 2 A m ] and e = 1. 

and using log (^ZjJ < (D — l)(l+log[(n— 1)/(D — 1)]) with the convention log ((n— 
l)/0) = we get the risk bound 

(28) E\H 2 {s,s)} <C(k,K) inf !.H 2 (s, S m ) + \m\ + (\m\ - 1) log ( , - ~ \ U . 

L J m&M y \ \m\ — 1/ J 

If, for instance, s itself belongs to some S m with a small value of \m\, which corre- 
sponds to a piecewise stationary process (Ni)i<i< n with a few distribution changes, 
the risk is bounded by C(k, K)\m\ logn. 

Another interesting situation corresponds to the case of a monotone sequence 
(si)i<i< n , i-e. a monotone function s on X that we may assume, without loss of 
generality to be nondecreasing. 

Proposition 7. Let the sequence Sj, 1 < i < n be nondecreasing with -/s^—y/si = R, 
then the PHE s based on the models of Example 4 with pen and e given by Jff7| ) 
satisfies the following risk bounds with a constant C depending only on k and K: 

• if R 2 < n- 1 log n, then E [H 2 (s, s)} < C{k, K) (nR 2 + l) ; 

• ifR> n/V3, then E [H 2 (s,s)] < C(k,K)u; 



otherwise E [H 2 {s,s)] < C(k,K) [Ry/nlog(n/R)] 



2/3 



Remark: If we restrict ourselves to the case n = 2 , we can turn any function s on 
X into a function s' on [0,1) by setting s' = Ya=i s ^)\{i-i)2~ k ,i2~ k )- This trans- 
formation will, in particular, preserve the monotonicity properties of the functions. 
One could then estimate s' using the more sophisticated families of weights that we 
introduced in Section 13.2.21 The use of this strategy would improve the estimation 
of monotone functions, removing the logarithmic factors. 

Example 5, continued. Choosing pen and e as in l|27|) and using the same arguments 
as for Example 1, we derive an analogue of (|28jl with n replacing n — 1 in the 
logarithmic factor. If we assume that Si = s for i I with |/| = k, then H 2 (s, S m ) = 
for some m 6 A4k and 

E [H 2 (5,s)} < C{k, K)[k + I + klog(n/k)]. 
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5. Special counting processes on the line 

Let X be some interval of R+ of the form [0, £) where < £ < +oo with its Borel 
ex-algebra A. We recall that a (univariate) counting process N on X is a cadlag 
(right-hand continuous and left-hand limited) process from X to R + , vanishing at 
time t = 0, with piecewise constant and nondecreasing paths having jumps of size 
+1 only. The use of counting processes in statistical modeling is developed in great 
details in the book by Andersen et al. (1993) where the interested reader will find 
many concrete situations for which these processes naturally arise. Typically, Nt 
counts the number of occurrences of a certain event from time up to time t. 
The jumping times of the process give the dates of occurrence of the event. A 
counting process can be associated to a random measure N on X whose cumulative 
distribution function is the counting process itself, i.e. iV([0,t]) = Nt for all t G X. 
In the sequel, we shall not distinguish between the counting process N and its 
associated measure N. 

In this paper, we consider a phenomenon which is described by some bounded 
counting process N* on X such that N*(X) < k a.s. for some known integer k. 
This means that N* describes an event that occurs at most k times during the 
period X. We also assume that there exist a deterministic measure A on X, a 
deterministic nonnegative function s G I^\(X,dX) and a nonnegative observable 
process Y* bounded by 1 on X such that 



(29) E[JV*([0,t])] =E 



sY*d\ 



o 



for all t G X. 



We actually observe an aggregated counting process N which is the sum of n i.i.d. 
processes N J , j = l,...,n with the same distribution as ./V*. The fact that the 
measure N 3 is determined by its cumulative distribution function and Q29J1 imply 
that there are i.i.d. observable processes Y J , j G {1, ... ,n} with the distribution of 
Y* such that 



E [N j (A)] =E 



sY^dX 

A 



for all A G A and 1 < j < n. 



Therefore Q holds with M = YdX and Y = X^j=i ^ 3 ■ F° r such counting processes, 
we can prove the following result. 

Theorem 5. Assume that there exist a positive integer k and a positive num- 
ber k' , both known, such that N*(X) < k a.s., holds and Var [J^ sy*(iA] < 
«/E \Jj sY*dX\ for all intervals I C X. Assume moreover that J x sdX < +oo and 
the aggregated process N satisfies Let us choose a family A4 satisfying Assump- 
tion H and weights {A' m , m G Ai} such that 

(30) ex P[-vKn] = S'(r/) < +oo forr] = k(k+ f sdx\ . 
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Then the estimator s m defined in Section \2.tA with penfml > lQ5\m\(k+K')+404kA' ni 
satisfies 

E[tf 2 (s m ,s)] 

< 390 I E 



inf (H 2 (s, S m ) + pen(m)) 



+ 404fcr ? - 1 [S'(??)] 2 +e 



< 390 ( inf {E [# 2 (s, S m )l + pen(m)} + 404A;r/- 1 [S'(r/)] 2 ] + e. 



In the last bound, E [H 2 (s, S m )] plays the role of a bias term which can be 
bounded in the following way. Let us set 

S' m = \ t = ^2 i/lj with ti > for all I G m i p £, 
I iemnj J 

where the ij- are now deterministic. Then S 1 ^ C S'm, hence H 2 (s,S m ) < H 2 (s,S' m ) 
and, for f G fi^, 

H 2 {s,t)= j (y~s~ — y/t) 2 Y dX < n f Lfi - VtY dX, 

since Y < n. Finally 

E [H 2 (s, S m )] < n inf / - Vt) 2 dA = b 2 m {s) 



and 



E \H 2 ( Slh , a)] < 390 inf {6 m (s) + pen(m)} + 



404/c 

7? 



;S'(77)] 2 )+b. 



Note that the present framework includes, as a particular case, density estimation, if 
we observe an n-sample Xi, . . . , X n with density s with respect to A and set N 3 (A) = 
1a{Xj). Then Y = n and H 2 (s,t) = nj x (y/s — y/t) dX which corresponds to 
using the distance H of Section I4~T1 multiplied by y/n. Up to this scaling factor, the 
previous risk bound is analogue to that for estimating densities we get in Theorem^ 

In order to derive risk bounds which are similar to those given in Proposition^ 
we have to distinguish between two situations. The most favorable one occurs when 
we know an upper bound T for J x s dX, in which case, since < Y* < 1, 

sY dX ' 



Var 



sY*dX 



< E 



< 



sdX]E 



x 



sY*dX 



and we can set k' = T. Moreover, assuming that © holds, we can choose A' m = 
(l + k T) A m without any further restriction on the family of models. Using the 
same family of partitions as in the density case, we recover the bounds of Propo- 
sitions n and [21 up to the factor n corresponding to the rescaling of the distance 
H. 

Let us now turn to the less favorable situation where no bound for j x s dX is 
known, which is the typical case for Problem 4. As we shall see the number k' can 
still be computed. As to (|30|) it will be satisfied with A' m = \m\ as soon as the 
number of models such that \m\ = D is bounded independently of D. Restricting 
ourselves to the family Mr of regular partitions, we recover, up to the factor n, the 
bounds provided by case ii) of Proposition ^ 



22 



YANNICK BARAUD AND LUCIEN BIRGE 



5.1. Survival analysis with right-censored data. Let us now consider the frame- 
work of Problem 3, denoting by Pt the common distribution of the Tj. We con- 
sider the counting process N on R + defined by N = Y^j=i N J where N J {A) = 

^{f eA D 1} ^ or an measura ble subsets A of R+, so that we can take k = 1. Then 

the variables N 3 (A), 1 < j < n are i.i.d. Bernoulli random variables. We define s to 
be the hazard rate of the survival times, i.e. s(t) = p(t)/¥[Ti > t] for t > 0. Since 
s is not integrable on R + we shall restrict ourselves to some bounded interval X 
of R+, which we can take, without loss of generality, to be [0, 1) if we assume that 
P[Ti > 1] > 0. We also assume here that the censorship satisfies for all t > 0, 

ft 



(31) 



E[N j ([0,t})] = E 



s{u)Y 3 (u)du 



with Y 3 (t) = l f . 



Tj for 



which means that (|29[) holds. Equality (|31|) is clearly satisfied when Cj 
all j, i.e. when the data are uncensored. It is also satisfied when the censorship 
is independent of the survival time, i.e. when Cj and Tj are independent for all j. 
Indeed, we then have for all j and t > 0, by Fubini Theorem and independence, 



E 



s(u)Y J (u)du 



E 



p(u) 



o HTj > u 
p(u)¥(Tj > u)F(Cj > u 
P(Tj > u) 



■du 



l m (u)P(Cj >u)dP T (u) 
= P [T 3 < t, Tj < Cj] = E [A"([0, t})] . 

Proposition 8. If the processes N 3 satisfy h'Jl\) . the assumptions of Theorem\^hold 
with k = 1, k' = 2 and f x sdX = - log(P[Ti > lj). 

From a practical point of view, one can always estimate P[Ti > 1] accurately 
enough to assume that an upper bound T for J x sd\ is known. We can therefore 
apply Theorem |S] to the the family of models of Example 1 with the weights A m 
given in Section H~T1 setting A' m = (1 + T)A m . We then obtain perfect analogues of 
Propositions n and El with constants C now depending on T. To avoid redundancy, 
we leave the precise statement of the risk bounds to the reader. 



5.2. Transition intensities of Markov processes. Within the framework of 
Problem 4, we associate to T\ q the counting process N* defined for t > by 



N*([0,t}) 
(32) 



1 



m,o<t} 



so that 



E[JV*([0,t])] = / p(u)du = E 



o 



iys(u)du 



and H29|) holds with Y*(u) = l{x u _=i}- O ur aim here is to estimate s on some 
bounded interval X of R+ from the observation of the counting process N = 
Sj=i N 3 where the A^ 's are i.i.d. copies of A^* associated to n i.i.d. copies X 1 , . . . , X n 
of the process X. If X takes only the two values and 1 and a.s. starts from 1 to 
reach 0, then the problem reduces to estimating the density p of T^o; it becomes 
novel when we have at least three states. In any case, we get the following result. 

Proposition 9. If the weights A' m satisfy X^meAl ex P[ — V^'m} < +°° f or all r] > 
and J x s(t)dt < +oo then Theorem^ applies with k = 1 and k' = 2. 
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6. A UNIFYING RESULT 

We want here to analyze our estimation procedure from the general point of view 
described in Section |2] and prove a risk bound for the estimator s, from which we 
shall be able to derive the previous risk bounds corresponding to all the specific 
frameworks that we considered. For this we introduce the following approximation 
for s in S m : 

(33) s m = ^2 tttt If with sj = / sdX. 

We need here a bound for H 2 (s m ,s m ) which holds uniformly for m G M.. It takes 
the following form: 

H' : There exist three positive constants a, b and c, c > 1 such that, for any m G M, 

(34) P [H 2 (s m ,s m ) > c\m\ + bz] < aexp[-z] for all z > 0. 

We can now derive bounds for the risk of the estimator s defined in Section [2,31 

Theorem 6. Let Assumptions H and H' hold and the weights A m satisfy Let 
the penalty pen(m) be given by 

(35) pen(m) > c6\m\ + bA m . 

and rh be any element of M satisfying Then the estimator s = s m satisfies 

(36) E \H 2 (s,s)} < 390 ( E inf (H 2 (s, S m ) + pen(m)) + abY? 12 J + e. 

\ \_m&M J J 

Note that such a result has been obtained without any assumption on the under- 
lying space X and the true value s of the parameter, apart from the fact that it 
belongs to C. Note also that in (|36|) . the infimum over m £ M occurs inside the 
expectation, which makes a difference when M, and therefore H(s, S m ), is random. 

As we have previously seen, 5 < 2 for all the models we consider. Moreover, we 
shall see in Sections 17.3.11 IT74~T1 and [?.5.1I that for Problems 0, 1 and 2, a = 1 and b 
and c take the form b = b'Cp and c = c'Cp where b' and d are numerical constants 
and Cp depends of the problem we consider (for instance Cp = n _1 for density 
estimation). If we choose pen(m) = coCp(\m\ + A m ) for some suitable numerical 
constant Co and e < Cp, it follows that (|3*H|) becomes 

E [H 2 {s,s)] 

< 390 { E inf (H 2 {s, S m ) + c C P (\m\ + A m )) + 2b'CpY?j2 ) + C P , 

\ [meM J / 

which gives (J7|l. If there is only one model m in the family M, we can fix A m = 0, 
hence S = 1, which leads to (J5J. 

Proof. Let m* be an arbitrary element of M. It follows from the definition of T> 
that for any m G A4, F 2 (s m ,s m ») < V(m) VP(m'). Therefore, 

(37) H 2 (srn, s m *) < V(rh) V V(m*) < V(m*) + e/3, 
by ©• It also follows from Q that, if T mjjn * < 0, then 

(38) ^ 2 (sm,s mV m') -ff 2 (s m .,s mV m') < 16[pen(m*) -pen(m)]. 
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Moreover 

H (s m , SmVm* ) H (s m « , SmVm*) 



f 's/ S m * V^m^ \Z^mVm* <^A 
H 2 (s m , S m * ) + 2 J (^\/ S m * — \/§rnj {\/ SmVm* ~ \J Sm*j dA, 



hence, by (|3*%)) and Cauchy-Schwarz Inequality, 

(Smi S m * ) 

< 16[pen(m*) - pen(m)] + 2 ^ - \/^) (^rnvm- - v 7 ^) ^A 

< 16[pen(m*) - pen(m)] + 2H(s m , s m *)H(s mVm * , s m *) 

< 16[pen(m*) - pen(m)] + ^H 2 (s m , s m *) + AH 2 (s m vm*,s m *). 
Therefore, for any m G Ai such that T mm * < 0, 

H 2 (s m ,Sm*) < 8i? 2 (s mVm *,s m «),+32[pen(m*) - pen(m)] 

and, since 

^ (%Vm* i Sm* ) 

< 4 [H 2 (SmVm* j SmVm* ) + -£^ 2 (s~mVm* , s) + i? 2 (s , S m * ) + -ff 2 (s m *jSm*)] > 

then 

(l/32)i? 2 (s m ,S m .) < i? 2 (s mV m*,SmVm*) +^ 2 (Sm*,Sm*) +Pen(m*) 

(39) -pen(m) +H 2 (s m vm*,s) + H 2 (s,s m *)- 
Let us set, for all z > and (m, m') G 7W 2 , 

tyj = P| G O | i? 2 (s mV m',SmVm') < c|m V m'| + fe[A m + A m / + z] } . 

(m,m')eA4 2 
It follows from (JSU) that 

(40) F[n c z ]<ae- Z e" Am - A -' = S 2 ae^. 

(m,m')GA-1 2 

Let now uj belong to £l z . It then follows that 

(41) H 2 (s m *,s m *) < c\m*\ + 2bA m * +bz < 2pen(m*) + bz 
and, using Assumption H, that 

H 2 (smvm*,smvm*) < c6[\m\ + \m*\] + b[A m + A m * + z]. 

Therefore we derive from (|39|) . and (|35|) that, for all m G .M such that T m ,m* < 
0, 

(l/32)F 2 (s m ,s m *) < J ff 2 (s mV m*,s)+^ 2 (s,Sm*) + (l + 5)c|m*| 

+ 36A m » + 26z + pen(m*) 
< H 2 (s m vm*,s) + H 2 (s,s m *) + 2bz + 4 pen(m* ) . 

In order to control the bias terms H 2 (s,s m /) of the various estimators involved in 
the construction of s, we shall use Lemma El below. Since S m vm* ^> S m * for all 
m G A4, this lemma implies that 

H 2 (s m 'Vm* > s ) < 2H 2 (s, Sm'Vm*) < 2H 2 (s, S m *), 
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therefore 

H 2 (s m ,s m *) < 128 [H 2 (s,S m *)+pen(m*) + bz/2] , 

for all m G M such that T m ^ m * < and we conclude from (|37j) and the definition of 
T> that, if lo £ Q z , 

# 2 (sm,s m *) < X>(m*) +e/3 < 128 [# 2 (s, 5 m . ) + pen(m*) + bz/2] +e/3. 

Since 

^ 2 (Sm,s) < 3 [H 2 (§ m , Sm*) + H 2 (s m *,S m *) +H 2 (s m *,s)} , 

it follows from ([41 1) and Lemma |2] that 

H 2 {srn,s) < 3 [I30fl" 2 (s,5 m ») + 130pen(m*) +656z + e/3] . 
Since m* is arbitrary in A4 we finally get 

H 2 (s rh , s)ln z < 390 ( inf \H 2 (s, S m ) + pen(m)l + bz/2) + e. 

\mSM J 

An integration with respect to z taking (|40[) into account leads to (|36|) . □ 
Lemma 2. Within the framework of Section \2.1\ for any f G C, we have 

H 2 (f, f m ) < 2H 2 (f, S m ) With fm= ([ ^ aTTy) h ' 

Proof. Let X' = \Jj^ mn j I- Note that M is a finite measure on X' and that for all 

t G S m , 



H 2 (f,t) = H 2 (fl x ,,t)+ [ fdX. 

Jx\x> 



X\X' 

It is therefore enough to show the result for X' in place of X and fix' 111 place of 
/ and we can restrict ourselves to the case where M is a finite measure on X. Let 
\[J 1 be the ~hi{X-> dX) projection of v7 on Sm- Since the value of vT 7 on I is given 
by Jt y/fdX/ 'A(J), it suffices to prove that for each / G ran J 



By homogeneity, we may assume that A (I) = 1. Expanding the left-hand side of (02 
we get 



which, together with the inequality w J*j /<iA > fj \ffdX, leads to the desired result. 

□ 
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7. Proofs 



7.1. Proof of Lerama|I] Let m = m p V ICj and m! = m p > V YZy be two elements 
of M and I p = ([0, l) k \ Uj 6p l), I p > = ([0, l) fc \ U// ep /J'). Assuming, with no loss of 
generality, that j > j', we get 

m V m = m p V m p > V /Cj V /C,/ = m p V m p / V ICj = mi U m-2 U 7713 U 7714, 

with 

mi = {i^n/n/'/0| J fYG/C j ,lGp,/ / Gp'}; 

m 2 = {Kninip' ^ 0\K eJCj,i ep}-, 

m 3 = {rrn/ ? n/'^0|if6^,7'ep'}; 

m4 = n I p n i p > ^ 1 K e Kj } . 

Since j < J(p), hence p C Uz>j/Q, for if G /Cj and J G p, K fl / is either I or 0, 
so that m = p U pj with pj = {if n i p 7^ 0, K G ICj} and |m| = \p\ + |pj|. It also 
follows that |mi| < \p\ + \p'\ and |m2| < \p\. Then, given K G /Cj and I' G p', KnI' 
is either if or I' or since K,I' G /C, so that \m^\ < \pj \ + \p'\. Finally \rrn\ < |j>j| 
and 

|m V m'| < 2 [\p\ + + < 2(\m\ + |m'|). 



7.2. Some large deviations inequalities. The proofs of Theorems QE1 HI and El 
require to check (|34[) for each specific framework. Since 

(43) H 2 (s (VN(T) - y^y) for all m G M, 

IemllJ 

this amounts to proving some deviation results for quantities of the form 

_\ 2 
1 -m\ 



which is the purpose of this section. Throughout it, we consider a finite set of 
non-negative random variables Xj with I G m and the related quantities 

2 



(44) 



X 



the notation suggesting that these variables behave roughly like x 2 random variables 
as we shall see. Our purpose will be to derive deviation bounds for those variables 
from their expectation. Our first result is as follows: 

Theorem 7. Let (Xt) i^m be a finite set of independent non-negative random vari- 
ables and x 2 ( m ) be given by \44\ )- We assume that there exists k > and r > 
such that 



(45) 
and 
(46) 



log (e 



z 2 Epf 7 ] foraUze [M/r[j 



2(1 - zt) 



log E 



-a(X/-K[X/]) 



< K 



z 2 E [X 



^ /or a// z > 0. 
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Let 



Then for all x > 0, 



(47) 
(48) 



X 2 (m) > E [x 2 M] + K -2 * ( 2y / 2H^ + 



<e~ x , 



X 2 (m) < E [x 2 M] - 2K 2 k^/2 



mix 



< e~ x . 



Proof. Let us first introduce the following large deviation result, the proof of which 
follows the lines of the proof of Lemma 8 of Birge and Massart (1998). 

Lemma 3. Let Y\, ... ,Y n be n independent, centered random variables. If 

z 2 9i 



log (E [e zYi ]) < k— l — for all z € [0,l/r[ and 1 < i < n, 



then 



n 


/ n \V2 




2kx 6>j +n 


i=i 





< e x /or aZZ x > 0. 



forl<i<n and all z > 0, log (E [e"^ 1 ]) < Kz 2 6i/2, then 

n / n \ V 2 

E^<~ 2 «*E^ 



i=l 



1=1 



< e x for all x > 0. 



It follows from and Lemma El with n = 1, Y x = Xj - E [X/] and 

#1 = E [X/] that, for all x > and I £ m, 



Xi > E [X/] + t/2/cE [X/] x + tx 



< e~ 



and 



Xi < E [X 7 ] - a/2kE [X/] 



Setting u = E [X/] /(kx), we deduce that, with probability not smaller than 1 — 2e 



X,- y/W\Xi] 

< max 



'kx max 



V® [Xi] -JJi [Xj] - T^E [Xj] x) + ; 
E [X 7 ] + v / 2kE]x7Jx + tx - ^E [X/] 
|^/u- J (u — V2uj^] \J u + a/2u + (t/k) - 



< \/kx sup max <j y^i - 

z>0 



z - V2z 



z + V2z + (t/k) - ^[z 
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On the one hand, note that z —>■ yfz — J (z — \/2z) + admits a maximum equal to 

\/2 for z = 2. On the other hand, using the inequality y/a~+b < y/a + yb which 
holds for all positive numbers o, b, we obtain for all z > 0, 



z + V2z + (t/k) - yfz < 



+ 



T 1 

ft 2 



V2 



r 1 
~~k 2 



and therefore 
or equivalently 



X/ — a/E [X/] < Kx/kx with probability not smaller than 1 — 2e 



(49) P [17/ > if 2 x] < 2e" :!; for all x > with £// = ft" 1 ( yjx~i - \/E[X T 



Since x 2 ( m ) = ^Yli^m^i an< ^ the random variables f//, J € m are independent, 
(j47j) will derive from Lemma |3] if we show, setting E/ = E [17/], that 

dK 4 7 2 

for all z £]0,1/K 2 [. 



(50) 



log E 



MUi-Ei) 



< 



Similarly, (|48|) will follow from 



(51) 



log E 



-z(U I -E I ) 



2(1 - K 2 z) 
4KV 



< 



for all z > 0. 



To prove (|50[) . we shall use the following lemma about the centered moments of 
positive random variables. 

Lemma 4. Let Z be a non-negative random variable. For any positive even integer 
k, 



E 



(z-E[z]y 



< E 



z 



(E [Z]) k < E 



Z k 



Note that the inequality E 



[Z — E [Z]) fc < E [Z fc ] also holds true for odd integers 



k since E [Z] > and the map z i— > 2 is then increasing. 

Proof. Since the result is trivial for k = 2, we may assume that k > A and, using 
homogeneity, that E [Z] = 1. Consider the function z \— > Q(,z) = — (z— l) fc — At(z— 1) 
on [0, +oo[. Its second derivative is negative for z < 1/2 and positive for z > 1/2, 
from which we easily derive that Q has a minimum for z = 1. This shows that 
Q(,z) > 1 for all z > and consequently, 



E 



Z h 



E 



(Z-l) 



E[Q(Z)] >Q(1) 



1 



which leads to the result. 



□ 



The random variable Ui is positive and by (|49[) satisfies P [Ui > t] < 2e~ t ^ K ' 2 . 
Consequently, we deduce from the previous lemma (with Z = Uj) that for all integers 
k (odd or even) 



(52) 



E 



(Uj - Ej) k 



< E 



kt k - x Y [Ui >t]dt< 2{k\)K 



2k 
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Hence, for all z G]0,1/K 2 [, 



log E 



AUi-Ei) 



<log l + + 2^^K 2 M <2^z k K 2k = -j 



AK 4 z 2 



k>2 



k>2 



(1-K 2 z)' 



To prove (|ST|) . note that, for all z,u > 0, e zu < 1 — zu + z u /2. Therefore, by 
(El, 



log E 



-ziUj-Ej) 



AK A z 2 



log(E[ e -^])+z^< y E[C/ / 2 ] < 



which completes the proof of Theorem D 

A second pair of deviation inequalities for variables of the form x 2 ( m ) is as follows. 
Theorem 8. Let m be a finite index set and Xj = (X/j)^ , 1 < j < p be i.i.d. 

random vectors with values in Rt 1 '. Assume that there exist positive numbers A and 
k such that 

(53) X H ^ A a - s - and Var ^ kE t X Ai] /° r al1 1 e m - 

If Xi = Y^j=i for all I € m and x 2 ( m ) * s given by \44\l > then 

(54) P [x 2 ("i) > 8«|m| + 202Ac] < e~ x for all x > 0. 

Proof. Since = a.s. if EfX^i] = 0, we may remove all indexes / such that 
E[Jfj i] = in the sum and therefore assume that E [Xj] = pK [Xj t \] > for all 
I E m. We can then write, for all z > 0, 



y/X 2 (m) > z 



E 
E 



Xj-^/EjxT] 



m) 



X! - VE [X/]) > z, > 



E 

Jem 



E 



v / ^(to) VA7+7E1X7 

- VnXA) (Xi tj - E [Xjj]) 



> z, \JX 2 {ra) > z 



p (yxi - 7e[a^t) 7e[a^t Xij - e 



> z, V^™) > 2 



> z, \JX~Hm) > z 



where 



EE'- 

j=l l£m 



tj 



Xjj - E [X Id ] 

v^lxT] 



> z, yfxHm) > 



v^(^) (v^f+ v^ra 



for all I £ m. 
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Note that E/em*! < 1 since ^E [Xi]/(yfiC] + y/E [X/]) < 1 and that \ti\ < 

z _1 -y/E [Xj] on the set X 2 (m) > z, from which we deduce that 

(55) 



P (y/X 2 (m) >z)< 



sup E E tj — — — - > z 



teT 



j=l l£m 



y/E[Xj] 



where T denotes the set of vectors t = (tj) I< - m £ PJ m ' satisfying 
(56) < ^ E ^ for all / € m and 



In order to bound the right-hand side of (|55|). we shall use the following result from 
Massart (2000, Theorem 2.4). 

Theorem 9. Let £ 1 , . . . , £„ 6e independent random variables with values in some 
measurable space Ti and T be some countable family of real valued measurable func- 
tions on H such that ||/||oo <b< +oo for all f £ T . If 



Z = sup 



3=1 



and 



a 



sup 



3=1 



then for every positive numbers e, x 

f\z>(1 + e)E [Z] + 2aV2x~ + (2.5 + 32c" 1 ) bx 



< e" 



We want to apply this result to the vectors £j £ pJ m ' with coordinates £jj = 
(Xjj - E [X IJ ])/ v / E[Xj\ for I £m. Under our assumptions, these random vectors 
are independent and satisfy 



lent 



Consequently, the random vectors ^ take their values in the subset H of . 
by 



given 



n 



u = (uj, I £ m) 



VE[X7]\ UI \ < 2A I 



For u £ Ti and t £ T, we set /^(w) = Y^ie m ti u i an d = {/t>* £ ^"'} where 
T denotes a countable and dense subset of T . With no loss of generality we can 
assume that T is symmetric around (if t £ T then —t £ T') which implies that 
the absolute values can be removed in the definition of Z. Since, for all t £ T and 
1 < j < P, ft(€j) is centered, we can finally write 



Xjj - E [X Itj ] 



z = sup tj - 



sup 



X Itj - E [Xjj] 



J=l 



j=l I em v^i^-n t^T Iem 

Using Cauchy-Schwarz Inequality and (|56|). we then derive that 

2 



y / E[X7] 



E 2 [Z] < E [Z 2 ] < ^ E 

j£m 



^Xjj-EIXjj] 



J =1 



EE 



Var(X /j 



Jgm 
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Since Var(Xjj) < kE [Xjj] and Y%=\ E i X i,j] = E [ X i\> we conclude that E \Z\ < 
\Jn\m\. To bound H/^Hooj we use (|5^)l which implies that, for all u G TL and t G T, 

2A 



\ft( 



u 



Jem 



<LI*I^L i ^ : 

Jem Jem 



Finally, it follows from the equidistribution of the Xj, Cauchy-Schwarz Inequality, 
1(5311 and that, for all t G T, 



£Var(/ t (^.)) = pVar(/ t (0) 



pE 



- E [X IA ] 



< 2E 



< 2E 



Ytj Xl ' X \ +2(Vt n /E[X /1 A 



< 2A E 



Jem 



2 -X"/,l 



Jem 



+ 2 £ i 2 £ E^J 
< 4A. 



In view of all these bounds, we may apply Theorem ED with a 2 = AA, b = 2A/z and 
e = 1 and obtain that P \J x 2 ( m ) > z — e ~ x as soon as z > 2\J K\m\ + A.\/2Ax + 
69Ax/z. Solving this quadratic inequation and using (a + b) 2 < 2 (a 2 + 6 2 ), we can 
check that this inequality holds if z 2 > 8«|m| + 202 Ax, hence the result. □ 



7.3. Density estimation. 

7.3.1. Proof of Theorem^ For two given classes m,m' G A4, we apply Theorem |H] 
with m" = m V w! in place of m, p = n and Xjj = iy el for all / G m" and 
j = 1, . . . ,n. Then Xj = nN(I) and 1(53(1 is satisfied with A = k = 1 since i 
is a Bernoulli random variable and we derive from ((43(1 that, for all x > 0, with 
probability not smaller than 1 — e~ x , 

H\s m „,s m ,.) = £ (v^W - -Mm) 2 = < M±j°* 

v / n n 

/em" 

Therefore ((34(1 holds with c = 8/n, a = 1 and 6 = 202/n. We then conclude from 
Theorem El and the fact that H 2 (t,u) is always bounded by 2. 

7.3.2. Proof of Proposition^ By assumption, y/s has a variation bounded by R and 
we may apply to it Corollary 1 of Barron, Birge and Massart (1999) with a = 1, 
D = 2 J with j > 2 and iV = 2 3:; . It follows that one can find m G Ai^j.D such that 
H 2 (s,S m ) < (64/3)(R/D) 2 . Since pen(m) < CjDn^ 1 for m G Msj t D, we derive 
from Theorem Q that 

E s [H 2 (s, s)] < C inf {i? 2 2- 2j + j^tT 1 } . 

Then ((11() follows if we define j > 2 by 

4^' +1 < [ni? 2 / log (1 + nR 2 )] ^ < 4^'+ 2 , 
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which is always possible since nR 2 > 0, and distinguish between the cases j = 2 
(which corresponds to nR 2 < 26.519) and j > 2. 

When y/s is continuous with modulus w, there exists an element t G S mj such 
that — V^lloo < u>(2~ 3 ), hence H(s,S mj ) < w(2~ 3 ). Since > 0, we can 
choose j such that 2~- ? < x w < 2~ 3+1 . Recalling that pen(m 3 ) < C2-? /n, we deduce 
from Theorem n that 

E s [# 2 (s,s)] < C [w 2 {2~i) +n~ l 2i] < C [w 2 (x w ) + 2(nx w )- 1 ] < 3C'(nx w )-\ 

which proves (JUJ). If belongs to H% with i? > ra" 1 / 2 , then x w = (nR)' 2 ^ 2a+1 ^ 
and the risk bound follows. 

If s belongs to S 3 (D,R), we can write s = Ylk=i s k\ Xk _ 1 ,x k ) with = xo < x\ < 
... <x D = l and 8np x < k < D s k < R. Fix I such that 2 l > nR > 2 1 ' 1 . Then 2 l > 2D 
and for < k < D, set x' k = sup{x G Si \ x < x^} and t = Ylk=i s k\ x ' lt x') so that 
t G S m with m G M.i t D> with D' < D since some intervals [x' k _ 1 ,x' k ) may be empty. 
Then 

D-l 

# 2 (s,t) < ^^(xfc 



xi) < RD2~ l . 



k=l 



Recalling from © that pen(m) < Cn~ 1 [D(Zlog2 + 2-log.D) + 21ogZ] for m G M^d, 
we conclude from Theorem ^ © and our choice of I that 

E s [H 2 (s,s)] 

< C \rD2~ 1 + [D(l log 2 + 2 - log D)+2 log /]n~ ] 

< C'{D/n) Z + \og2 + \og(2 l - 1 /D}+2D- 1 \ogl 

< C'{D/n) [3 + log 2 + log (nR/D) + 2(D log 2)" 1 log log(2nfl)] 
and (|13|) follows since nR > 2D. 

7.4. Random vectors. 



7.4.1. Proof of Theorem^ For two given elements m, m! G M, we apply TheoremQ 
with m" = m V m' in place of m and X/ = N(I). We derive from the independence 
of the Ni that (|45[) and (|46() hold. Therefore, for all a; > 0, with probability not 
smaller than 1 — e~ x , 

H 2 {s m „,s m ») = Y, (VN(I)-V^[N(I)}] 2 
l£m" 



< E 



Jem" 



m"\x + x I . 



If follows from (@SJ that Var(iV(J)) < kE [N(I)} (expand both side of TO in a 
vicinity of 0) and therefore 



E 



Jem" 



< 



l£m" 



0v(7) - [ivxo] 

(iV(/)-E[iV(J)]) 2 
E[N(I)} 



< nlm' 
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Using the inequality 2\j2\m"\x < \m"\ + 2x we conclude that, with probability not 
smaller than 1 — e~ x , 

(57) H 2 (s m „, s m „) < (1 + K 2 ) K \m"\ + 3K 2 kx. 

We derive that © is fulfilled with c = (l + K 2 ) k, b = 3K 2 k, a = 1 and TheoremH 
follows from Theorem 



7.4.2. Proof of Proposition^ Let us first note that, if \m\ = n, then H 2 (s,S m ) = 0, 
hence by £8), E [i? 2 (s,s)] < C(k,K)u which proves the bound when R > n/vo. 
For the other cases, we deduce from Lemma below that, for any D E X, one can 
find some m <E M such that |m| < D and H 2 (s,S m ) < n(R/D) 2 . Setting D = 1, 
we get the result for the case -R 2 < logn. Finally, when n _1 log re < R 2 < n 2 /3 
we fix D = inf { j G N | j 3 > nR 2 / log(re/i?)}. Since the function R ^ R 2 / log(n/R) 
is increasing for R < re/\/3, 1 < D < n and the corresponding risk bound follows. 

Lemma 5. Let f be a nondecreasing function from X = {1, . . . ,re} to 1R such that 
\J /(re) — -y/ /(l) = i?. For D £ X, one can find a partition (Ji, . . . , o/ X into 

K < D intervals and a function g from X to M. of the form g = Y2k=l Pk^-Ik suc ^ 
that 

E(V7W - V^)) 2 ' <nR 2 D- 2 . 
i=l 

Proof: Let us set jo = 1 and define iteratively for k > 1, using the convention 
inf = n, 

(58) i fc = inf {i€0'*-i + l, ••-,»»} |\/7U)-\//0'ife-i)>^/£>}- 

Let = inf {k > 1, = n}, = {jx-i, . . . , re} and for k = 1, . . . , K — 1 (if 
•K" > 2), Ifc = {jk-i, ■ ■ ■ ,jk — !}• This defines a partition of X with elements and 
it follows from (|58j) that 

A'-l 

= V7(T)> ^y/JUk)-V7Ui^)>(K-i)R/D, 

k=l 

hence K — 1 < D and K < D. Let us now set /3& = f(jk-i) for 1 < k < K . Since 
- 1) - a/TU^T) < R/D we get for all i € 4, < yTIi) - v 7 ^) < 

Hence, 

n 2 A „ 

£(v7co-V*(o) =EE(^)-v / ^I) <^ 2 d- 2 . n 

i=i fe=i ieh 



7.5. Poisson and other counting processes. 



7.5.1. Poisson processes. The proof of Theor em follows the same lines as the proof 
of Theorem^] We apply Theorem [7] with m" = mVm' in place of rre and Xj = N(I). 
Since {iV(J), / £ m"} are independent Poisson random variables, the assumptions 
of the theorem are fulfilled with k = r = 1. We then proceed as for Theorem 0] to 
get ()57(l with K 2 = 2 which provides the relevant values of c and b. 
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7.5.2. Proof of Theorem^ Let us fix two classes m,m' G M. We first apply The- 
orem |H1 with m" = m V m' in place of m, p = n and Xj = N 3 (I) for all I G m" 
and j = 1, ... ,n. Then for all I G m", N(I) = Xj. Since Xjj is bounded by k, 



E 



< kE [X Id ] and (JH2J) holds with A 



k. This implies that, for all x > 0, 



with probability not smaller than 1 — e x , 

(59) ^ (V^ 7 ) - V E [Ni 1 )}) 2 < k (8\m"\ + 202x) . 



iem" 



Then we apply once again Theorem |H] with m" = mVm'in place of m, p = n and 
Xij = Jj sY J dX for all I G m" and j = 1, . . . , n. Since l"- 2 is bounded by 1, the 
assumptions of Theorem |H1 are fulfilled with A = f x sdX and n = k' . Consequently, 
with probability not smaller than 1 — e~ x , 



(60) 




< 8«/|m"| + 202Ac. 



Since E [J^ sd\\ = E [N(I)], we derive from 1)59(1 and ()60j) that, with probability not 
smaller than 1 — 2e~ x , 



H (s m ",s m »] 



lem" lem" V V Jl 

< 16\m"\(k + K) + 404x(k + A). 



sdX 



This means that (jSJ) holds with c = 16(k+n'), a = 2 and b = A0A(k+A). Therefore, 
if we set A m = k(k + A) _1 A^ for all m G A4, © holds with £ = E'(k/(k + A)) 
and pen(m) = 16<5|m|(fc + «;') + 404/cA^. An application of Theorem H3 leads to the 
result. 



7.5.3. Proof of Proposition The following argument shows that © is satisfied: 
let A be some measurable subset of X and B be the subset of A given by B = 
{t G A | A([0,t] n A) = 0}. Since, by definition, the sets [0,t] n 5 with t € B are 
negligible, A(i?) = (write 1? as an at most countable union of those sets). Conse- 
quently, 

¥{N{A) > 0, M{A) = 0) < Y F (^HA) = 1, J \ >t dt = 0^ 

n 

3=1 
n 

< ^P(Tj£5) = 

3=1 
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sd\ — , 
A? P[Ti>t] 



di = -log([P(2\ > 1)]) 



since —p(t) is the derivative of P[2\ > t]. Finally we can take «' = 2 since, whatever 
I (Z X, 



Var 



s(t)Y t *dt 



< E 



s(t)Y*dt 



E 



s(t)s(t')Y t *YJdtdt' 



Ixl 



Ixl 



Ixl 



s(t)s(t')E[Y t *Y t r] dtdt' 
s(t)s(t')F \t x > max{M'} 



2 / s (i) / l {t ,> t}S (t')P Ti > t' 



dtdt' 
dt! ] (it 



2 dt' 



< 2^ S (t)E J s(t')Y t 



2 y s(t)E [N 1 ^,!])] dt 



dt 



< 2 J s(t)F 



Ti > t 



dt = 2E 



s(t)Y t *dt 



7.5.4. Proof of Proposition^ Clearly Q29|) holds true. We now prove that Condition 
© is also fulfilled. Let A be some measurable subset of IR+ and for Z > 1 let B[ be 
the subset of A defined by 

Bi = {t £ a \ x(]t-r 1 ,t]nA) =o}. 

For each I > 1, note that the sets [t — l" 1 , t] n B[ C [f — i -1 ,t] H ^4 are negligible 
for t £ Bi and hence so is 5; (write 5; as an at most countable union of those). 
Denoting, for j = 1, . . . , n, the time of the jump of X 3 from state 1 to by T( Q , we 
have 

F(N(A) > 0, M(A) = 0) 

< jy(N*(A) = 1, jf l x i_ =1 <ft = o) 

n 

< J> (j\P(,4) = 1, 3e > 0, A (pZ* - £,Tf )0 ] ni)=0 

3=1 
n 

< E E p ( T i,o e a, a ([r£o - r 1 , 2^] n a) = o) 

3=1 2>1 

n n 

< jE p K efl| ) = EE 1 ^*™ = °> 

3=1 i>l 3=1 Z>1 



36 



YANNICK BARAUD AND LUCIEN BIRGE 



by (|32|) . We may clearly fix k = 1 and the choice of «/ is justified by the following 
argument. First note that whatever I <Z X and t > 

F(Xl = l, T^el, Tl >t) 

l {u > i} P (Xl = 1, u < Tlo < u + d«) 

l {u > i} P {XL = 1, = 1) P (u < Tl <u + du\ XL = 1, = 1) 
\ {u > t} F (XL = 1, = 1) P (n < Tin < u + du \ Xi_ = l) 



since X 1 is a Markov process. Hence 
P(X t 1 _ = l, Tl el, Tlo>t) -- 



It then follows that 



l {u>ty 



{XL = 1, XL = 1) s{u)du 



E 



\u>t} \x}_ =1} \xl_ =i } s(u)du 



Var (J s{t)Y?dt\ < E 



x 



E 



lj(t)lj(«)s(t)a(«)y t 1 3^ 1 d'urft 



XxX 



= 2 ^ E ^ 1 {«>t} ]1 {X t 1 _=l} ]1 {Xi_=l} s ( u )^ 

= 2jw{X}_ = 1, Tjo E /, > t) «(t)di 
< 2 Jf {X}_ = 1) s(t)dt = 2E J s{t)Y? 



s(t)dt 
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